BANNER-CHEMDNER: Incorporating Domain Knowledge in Chemical and Drug Named Entity Recognition

نویسندگان

  • Tsendsuren Munkhdalai
  • Meijing Li
  • Khuyagbaatar Batsuren
  • Keun Ho Ryu
چکیده

Exploiting unlabeled text data to leverage the system performance has been an active and challenging research topic in text mining, due to the recent growth of the amount of biomedical literature. Named entity recognition is an essential prerequisite task before effective text mining of biomedical literature can begin. The participants of the CHEMDNER task of the BioCreative IV challenge are asked to develop a chemical compounds and drugs mention recognition system and are given a set of annotated PubMed documents for training and a set of selected PubMed documents for evaluation of their systems by the organizers. Our primary goal is to develop a named entity recognition system that can scale well over millions of documents and can easily be plugged in a biomedical text mining system, while exploiting unlabeled data to leverage the system performance. We extracted Brown cluster labels and word embeddings, both induced from a large unlabeled document corpus as the word representation features in addition to the word, the word and character n-gram, and the traditional orthographic features. During the training, 2 ± order CRF model was built with varied feature spaces and the influence of the different groups of features defining the baselines of the experiment was observed in terms of the F-score, recall, precision as well as processing time. Our system achieves 81.41% and 82.58% of Fscores on CHEMDNER development set for CEM and CDI sub-tasks, processing ~530 documents per minute.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations

BACKGROUND Chemical and biomedical Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biochemical-text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature. We present a semi-supervised learning m...

متن کامل

Improvement of Chemical Named Entity Recognition through Sentence-based Random Under-sampling and Classifier Combination

Chemical Named Entity Recognition (NER) is the basic step for consequent information extraction tasks such as named entity resolution, drug-drug interaction discovery, extraction of the names of the molecules and their properties. Improvement in the performance of such systems may affects the quality of the subsequent tasks. Chemical text from which data for named entity recognition is extracte...

متن کامل

Comparison of different strategies for utilizing two CHEMDNER corpora

To identify chemical entities and drug names in patent according to CHEMDNER patent task-CEMP subtask, we use machine learning technique to construct a chemical named entity recognition (CNER) system. It is desirable for machine-based CNER system to have large training examples. Two CHEMDNER corpora have been developed. One is the corpus for the patent task and the other is the CHEMDNER corpus ...

متن کامل

CHEMDNER system with mixed conditional random fields and multi-scale word clustering

BACKGROUND The chemical compound and drug name recognition plays an important role in chemical text mining, and it is the basis for automatic relation extraction and event identification in chemical information processing. So a high-performance named entity recognition system for chemical compound and drug names is necessary. METHODS We developed a CHEMDNER system based on mixed conditional r...

متن کامل

Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization

BACKGROUND The functions of chemical compounds and drugs that affect biological processes and their particular effect on the onset and treatment of diseases have attracted increasing interest with the advancement of research in the life sciences. To extract knowledge from the extensive literatures on such compounds and drugs, the organizers of BioCreative IV administered the CHEMical Compound a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013